我们提出了一种用于计算自动语音识别(ASR)中错误率的新方法。这个新的指标是针对包含半字符的语言,可以以不同形式编写相同的字符。我们在印地语中实施了我们的方法论,这是指示上下文中的主要语言之一,我们认为这种方法可扩展到包含大型字符集的其他类似语言。我们称我们的指标替代单词错误率(AWER)和替代字符错误率(ACER)。我们使用wav2Vec 2.0 \ cite {baevski2020wav2vec}训练我们的ASR模型。此外,我们使用语言模型来改善我们的模型性能。我们的结果表明,在分析单词和角色级别的错误率方面有了显着提高,ASR系统的可解释性提高了高达$ 3 $ \%的AWER,印地语的ACER $ 7 $ \%。我们的实验表明,在具有复杂发音的语言中,有多种写单词而不改变其含义的方式。在这种情况下,Awer和Acer将更有用,而不是将其作为指标。此外,我们通过新的公制脚本为印地语开了一个21小时的新基准测试数据集。
translated by 谷歌翻译
我们研究应用语言模型(LM)对指示语言自动语音识别(ASR)系统输出的影响。我们微调WAV2VEC $ 2.0 $型号的$ 18 $指示性语言,并通过根据各种来源派生的文本训练的语言模型调整结果。我们的发现表明,平均字符错误率(CER)降低了$ 28 $ \%,平均单词错误率(WER)在解码LM后降低了$ 36 $ \%。我们表明,与多样化的LM相比,大型LM可能无法提供实质性的改进。我们还证明,可以在特定于域的数据上获得高质量的转录,而无需重新培训ASR模型并显示了生物医学领域的结果。
translated by 谷歌翻译
培训多语言自动语音识别(ASR)系统具有挑战性,因为声学和词汇信息通常是特定于语言的。由于缺乏开源数据集和不同方法的结果,培训对Indo语言的多语言系统更加困难。我们将端到端多语言语音识别系统的性能与以语言识别(LID)为条件的单语模型的性能进行比较。来自多语言模型的解码信息用于语言识别,然后与单语模型结合使用,以改善跨语言的50%WER。我们还提出了一种类似的技术来解决代码切换问题,并在印度英语和孟加拉国英语中分别达到21.77和28.27。我们的工作谈到了如何将基于变压器的ASR尤其是WAV2VEC 2.0应用于开发用于指示语言的多语言ASR和代码转换ASR。
translated by 谷歌翻译
我们提出Vakyansh,这是一种用指示语言识别语音识别的端到端工具包。印度拥有近121种语言和大约125亿扬声器。然而,大多数语言在数据和预验证的模型方面都是低资源。通过Vakyansh,我们介绍了自动数据管道,用于数据创建,模型培训,模型评估和部署。我们以23个指示语言和Train Wav2Vec 2.0预验证的模型创建14,000小时的语音数据。然后,对这些预审预告措施的模型进行了修订,以创建18个指示语言的最先进的语音识别模型,其次是语言模型和标点符号修复模型。我们以使命开源所有这些资源,这将激发语音社区使用ASR模型以指示语言开发语音的首次应用程序。
translated by 谷歌翻译
我们介绍了一个CLSRIL-23,一个自我监督的基于学习的音频预训练模型,它学习了来自23个指示语言的原始音频的交叉语言语音表示。它基于Wav2Vec 2.0之上,通过培训蒙面潜在语音表示的对比任务来解决,并共同了解所有语言共享的潜伏的量化。我们在预磨练期间比较语言明智的损失,以比较单机和多语言预制的影响。还比较了一些下游微调任务的表现,并且我们的实验表明,在学习语音表示方面,我们的实验表明,在学习语言的语音表示方面,以及在沿着流的性能方面的学习语音表示。在Hindi中使用多语言预磨模模型时,在WER中观察到5%的减少,9.5%。所有代码模型也都是开放的。 CLSRIL-23是一款以23美元的价格培训的型号,以及近10,000小时的音频数据培训,以促进在语言中的语音识别研究。我们希望将使用自我监督方法创建新的最新状态,特别是对于低资源指示语言。
translated by 谷歌翻译
When robots learn reward functions using high capacity models that take raw state directly as input, they need to both learn a representation for what matters in the task -- the task ``features" -- as well as how to combine these features into a single objective. If they try to do both at once from input designed to teach the full reward function, it is easy to end up with a representation that contains spurious correlations in the data, which fails to generalize to new settings. Instead, our ultimate goal is to enable robots to identify and isolate the causal features that people actually care about and use when they represent states and behavior. Our idea is that we can tune into this representation by asking users what behaviors they consider similar: behaviors will be similar if the features that matter are similar, even if low-level behavior is different; conversely, behaviors will be different if even one of the features that matter differs. This, in turn, is what enables the robot to disambiguate between what needs to go into the representation versus what is spurious, as well as what aspects of behavior can be compressed together versus not. The notion of learning representations based on similarity has a nice parallel in contrastive learning, a self-supervised representation learning technique that maps visually similar data points to similar embeddings, where similarity is defined by a designer through data augmentation heuristics. By contrast, in order to learn the representations that people use, so we can learn their preferences and objectives, we use their definition of similarity. In simulation as well as in a user study, we show that learning through such similarity queries leads to representations that, while far from perfect, are indeed more generalizable than self-supervised and task-input alternatives.
translated by 谷歌翻译
We address the problem of extracting key steps from unlabeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We employ self-supervised representation learning via a training strategy that adapts off-the-shelf video features using a temporal module. Training implements self-supervised learning losses involving multiple cues such as appearance, motion and pose trajectories extracted from videos to learn generalizable representations. Our method extracts key steps via a tunable algorithm that clusters the representations extracted from procedural videos. We quantitatively evaluate our approach with key step localization and also demonstrate the effectiveness of the extracted representations on related downstream tasks like phase classification. Qualitative results demonstrate that the extracted key steps are meaningful to succinctly represent the procedural tasks.
translated by 谷歌翻译
An oft-cited open problem of federated learning is the existence of data heterogeneity at the clients. One pathway to understanding the drastic accuracy drop in federated learning is by scrutinizing the behavior of the clients' deep models on data with different levels of "difficulty", which has been left unaddressed. In this paper, we investigate a different and rarely studied dimension of FL: ordered learning. Specifically, we aim to investigate how ordered learning principles can contribute to alleviating the heterogeneity effects in FL. We present theoretical analysis and conduct extensive empirical studies on the efficacy of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random curriculum. We find that curriculum learning largely alleviates non-IIDness. Interestingly, the more disparate the data distributions across clients the more they benefit from ordered learning. We provide analysis explaining this phenomenon, specifically indicating how curriculum training appears to make the objective landscape progressively less convex, suggesting fast converging iterations at the beginning of the training procedure. We derive quantitative results of convergence for both convex and nonconvex objectives by modeling the curriculum training on federated devices as local SGD with locally biased stochastic gradients. Also, inspired by ordered learning, we propose a novel client selection technique that benefits from the real-world disparity in the clients. Our proposed approach to client selection has a synergic effect when applied together with ordered learning in FL.
translated by 谷歌翻译
This paper tackles the challenging problem of automating code updates to fix deprecated API usages of open source libraries by analyzing their release notes. Our system employs a three-tier architecture: first, a web crawler service retrieves deprecation documentation from the web; then a specially built parser processes those text documents into tree-structured representations; finally, a client IDE plugin locates and fixes identified deprecated usages of libraries in a given codebase. The focus of this paper in particular is the parsing component. We introduce a novel transition-based parser in two variants: based on a classical feature engineered classifier and a neural tree encoder. To confirm the effectiveness of our method, we gathered and labeled a set of 426 API deprecations from 7 well-known Python data science libraries, and demonstrated our approach decisively outperforms a non-trivial neural machine translation baseline.
translated by 谷歌翻译
Using a comprehensive sample of 2,585 bankruptcies from 1990 to 2019, we benchmark the performance of various machine learning models in predicting financial distress of publicly traded U.S. firms. We find that gradient boosted trees outperform other models in one-year-ahead forecasts. Variable permutation tests show that excess stock returns, idiosyncratic risk, and relative size are the more important variables for predictions. Textual features derived from corporate filings do not improve performance materially. In a credit competition model that accounts for the asymmetric cost of default misclassification, the survival random forest is able to capture large dollar profits.
translated by 谷歌翻译